Importing Data

suppressPackageStartupMessages({
  library(tidyverse)
  library(dplyr)
  library(plotly)
  library(ggplot2)
  library(maps)
  library(mapdata)
  library(RColorBrewer)
  library(sf)
  library(tigris)
  library(leaflet)
  library(leaflet.extras)
  library(viridis)
  library(tidyr)
  library(tidytext)
  library(wordcloud)
  library(lubridate)
  library(readr)
})

Topic 1: Dating

Overview

Imagine you are a data scientist at a respected media outlet – say the “New York Times”. Your editor wants to support the writing of a feature article about How Couples Meet and Stay Together. Your editor-in-chief asks you to analyze some data from an extensive survey conducted by a researcher at Stanford University.

Since there is no way that all features of the data can be represented in such a memo, feel free to pick and choose some patterns that would make for a good story – outlining important patterns and presenting them in a visually pleasing way.

The full background and text of the story will be researched by a writer of the magazine – your input should be based on the data and some common sense (i.e. no need to read up on this). It does help, however, to briefly describe what you are presenting and what it highlights.

Provide polished plots that are refined enough to include in the magazine with very little further manipulation (already include variable descriptions [if necessary for understanding], titles, source [e.g. “How Couples Meet and Stay Together (Rosenfeld, Reuben, Falcon 2018)”], appropriate colors, fonts etc.) and are understandable to the average reader of the “New York Times”. The design does not need to be NYTimes-like.

Data

We will be using the 2017 wave of the HCMST survey - provided as HCMST_couples.rds. The file HCMST_variable_descriptions.pdf contains most of the variable descriptions and coding of responses.

dating_data <- readRDS("/Users/ulrika/Documents/GitHub/course_content/Exercises/03_dating_GRADED/HCMST_couples.rds")

2. Age is just a number

Create one (1) visualization to show the relationship between a respondent’s age and their partner’s age, accounting for the gender of the respondent? Identify the main pattern in the graph via an annotation directly added to the plot.

dating_data %>% 
  filter(!is.na(ppage), !is.na(Q9), !is.na(ppgender)) %>%
  ggplot(aes(x = ppage, y = Q9, color = ppgender)) +
  geom_point(alpha = 0.6) +
  scale_color_manual(values = c("Male" = "lightskyblue", "Female" = "pink1")) +
  geom_smooth(method = "lm", se = FALSE, color = "red2") +
  labs(title = "Relationship Between Respondent's and Partner's Age",
       subtitle = "Accounting for Gender of the Respondent",
       x = "Respondent's Age",
       y = "Partner's Age",
       color = "Gender") +
  theme_minimal() +
  annotate("text", x = Inf, y = 0, label = "Across genders, there is a strong
          positive correlation between the ages of respondents and their 
          partners, with a tendency for individuals to partner with someone
          of a similar age.", 
           size = 3, color = "indianred3", hjust = 1, vjust = 0)
## `geom_smooth()` using formula = 'y ~ x'

3. Politics and Dating

Explore how the political affiliation of partners affects how couples meet and stay together.

Create two (2) charts in this section to highlight the information on politics and dating. Make sure to use some variation in the type of visualizations. Discuss which visualization you recommend to your editor and why.

# Graph 1: Bar Chart
dating_data %>%
  filter(!is.na(partyid7), !is.na(meeting_type_simplified)) %>%
  ggplot(aes(x = meeting_type_simplified, fill = partyid7)) +
  geom_bar(position = "dodge") +
  labs(title = "Meeting Modes by Political Affiliation",
       x = "Meeting Mode",
       y = "Count",
       fill = "Political Affiliation") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_brewer(palette = "RdBu")

# Graph 2: Heat Map
avg_duration <- dating_data %>%
  filter(partyid7 != "Refused", w6_q12 != "Refused") %>%
  group_by(partyid7, w6_q12) %>%
  summarise(average_duration = mean(duration, na.rm = TRUE), .groups = 'drop')

ggplot(avg_duration, aes(x = partyid7, y = w6_q12, fill = average_duration)) +
  geom_tile() +
  scale_fill_gradientn(colors = RColorBrewer::brewer.pal(9, "YlOrRd")) +
  labs(title = "Average Relationship Duration by Political Affiliation Pairings",
       x = "Respondent's Political Group",
       y = "Partner's Political Group",
       fill = "Average Duration (days)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Recommendation

I recommend the heat map for the feature article. This visualization goes beyond the initial meeting and dives into the dynamics of relationship duration within the context of political alignment, offering a deeper understanding of how shared or differing political views might influence relationship longevity. While the bar chart offers valuable insights into meeting modes, the heat map provides a more comprehensive view of relationship dynamics, which seems especially relevant for a feature article exploring the nuances of how couples meet and stay together in the context of political affiliations.

4. Make two plots interactive

Choose 2 of the plots you created above and add interactivity. For at least one of these interactive plots, this should not be done through the use of ggplotly. Briefly describe to the editor why interactivity in these visualizations is particularly helpful for a reader.

interactive_line <- plot_ly(data_aggregated, x = ~Q21A_Year, y = ~Frequency, 
                               type = 'scatter', mode = 'lines', color = 
                                 ~meeting_type_simplified, colors =
                                 RColorBrewer::brewer.pal(8, "Set3")) %>%
  layout(title = "How Couples Meet Over Time",
         subtitle = "Trends in Meeting Venues",
         xaxis = list(title = "Year"),
         yaxis = list(title = "Number of Couples"),
         hovermode = "closest")

interactive_line
## Warning: 'layout' objects don't have these attributes: 'subtitle'
## Valid attributes include:
## '_deprecated', 'activeshape', 'annotations', 'autosize', 'autotypenumbers', 'calendar', 'clickmode', 'coloraxis', 'colorscale', 'colorway', 'computed', 'datarevision', 'dragmode', 'editrevision', 'editType', 'font', 'geo', 'grid', 'height', 'hidesources', 'hoverdistance', 'hoverlabel', 'hovermode', 'images', 'legend', 'mapbox', 'margin', 'meta', 'metasrc', 'modebar', 'newshape', 'paper_bgcolor', 'plot_bgcolor', 'polar', 'scene', 'selectdirection', 'selectionrevision', 'separators', 'shapes', 'showlegend', 'sliders', 'smith', 'spikedistance', 'template', 'ternary', 'title', 'transition', 'uirevision', 'uniformtext', 'updatemenus', 'width', 'xaxis', 'yaxis', 'barmode', 'bargap', 'mapType'
filtered_data <- dating_data %>%
  filter(!is.na(ppage), !is.na(Q9), !is.na(ppgender))

interactive_scatter <- plot_ly(data = filtered_data, x = ~ppage, y = ~Q9, 
                               type = 'scatter', mode = 'markers',
                               color = ~ppgender, colors = c('Male' = 'lightskyblue', 'Female' = 'pink1'),
                               marker = list(opacity = 0.6)) %>%
  layout(title = "Relationship Between Respondent's and Partner's Age",
         xaxis = list(title = "Respondent's Age"),
         yaxis = list(title = "Partner's Age"),
         hovermode = 'closest')

interactive_scatter

Interactivity allows readers to engage directly with the data in ways static charts do not. Users can hover over specific points to get precise information, such as the exact number of couples meeting through different venues each year or the exact ages of respondents and their partners. This direct engagement facilitates a deeper understanding of the data, enabling readers to discover insights that might not be immediately apparent from a static representation. Interactive elements empower readers to focus on aspects of the data that interest them most. For example, they can zoom in on specific time periods in the line chart to examine trends in greater detail or filter the scatter plot by gender to compare trends across different groups.

Topic 2: Severe Weather

Overview

We are going to investigate severe weather events and their impact. Using data about the locations of weather events occurring in the United States, we want to better understand and visualize their spatial distribution.

Data

NOAA’s National Centers for Environmental Information (NCEI) has data on all severe storm events.

severe_weather_data <- read_csv("/Users/ulrika/Documents/GitHub/course_content/Exercises/07_severe_weather_GRADED/data/storms.csv")
## Rows: 380137 Columns: 49
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (18): STATE, MONTH_NAME, EVENT_TYPE, CZ_TYPE, CZ_NAME, WFO, BEGIN_DATE_T...
## dbl (26): BEGIN_YEARMONTH, BEGIN_DAY, BEGIN_TIME, END_YEARMONTH, END_DAY, EN...
## lgl  (5): CATEGORY, TOR_OTHER_WFO, TOR_OTHER_CZ_STATE, TOR_OTHER_CZ_FIPS, TO...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

1. Damage from Storms

a) State Level Choropleth Maps

Provide a static state-level choropleth map of the United States visualizing where monetary damage is recorded (by using the sum of the variables DAMAGE_PROPERTY_USD and DAMAGE_CROPS_USD).

state_data <- severe_weather_data %>%
  group_by(STATE) %>%
  summarise(TOTAL_DAMAGE = sum(DAMAGE_PROPERTY_USD + DAMAGE_CROPS_USD, na.rm = TRUE))
state_data$STATE <- tolower(state_data$STATE)
us_map <- map_data("state")
map_data <- merge(us_map, state_data, by.x = "region", by.y = "STATE")
state_labels <- map_data %>%
  group_by(region) %>%
  summarise(center_long = mean(long), center_lat = mean(lat), .groups = 'drop')

ggplot(data = map_data, aes(x = long, y = lat, group = group, fill = TOTAL_DAMAGE)) +
  geom_polygon(color = "white") +
  expand_limits(x = us_map$long, y = us_map$lat) +
  scale_fill_viridis(option = "viridis", direction = -1, name = "Total Damage (USD)") +
  geom_text(data = state_labels, inherit.aes = FALSE, aes(label = region, x = center_long, y = center_lat), size = 2, check_overlap = TRUE, colour="black")+
  labs(title = "State-Level Monetary Damage from Storms in the US", x = "", y = "") +
  theme_minimal() +
  theme(axis.text = element_blank(),
        axis.title = element_blank(),
        panel.background = element_blank(),
        panel.border = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        plot.background = element_blank())

b) County Choropleth Maps

Provide a static county-level choropleth map of the United States visualizing where monetary damage is recorded (by using the sum of the variables DAMAGE_PROPERTY_USD and DAMAGE_CROPS_USD).

county_damage <- severe_weather_data %>%
  group_by(CZ_FIPS) %>%
  summarise(Total_Damage = sum(DAMAGE_PROPERTY_USD + DAMAGE_CROPS_USD, na.rm = TRUE))

counties_sf <- tigris::counties(cb = TRUE, class = "sf")
## Retrieving data for the year 2022
county_damage <- county_damage %>%
  mutate(CZ_FIPS = as.character(CZ_FIPS))
county_damage_sf <- counties_sf %>%
  left_join(county_damage, by = c("COUNTYFP" = "CZ_FIPS"))
ggplot(data = county_damage_sf) +
  geom_sf(aes(fill = Total_Damage), color = "snow4") +
  scale_fill_viridis_c(option = "viridis", direction = -1, 
                       na.value = "ivory", name = "Total Damage (USD)") +
  labs(title = "County-Level Monetary Damage from Storms in the US") +
  coord_sf() +
  theme_void() +
  theme(plot.title = element_text(hjust = 0.5)) +
  coord_sf(xlim = c(-125, -65), ylim = c(25, 50), expand = FALSE)
## Coordinate system already present. Adding new coordinate system, which will
## replace the existing one.

2. Location of Severe Events

a) Interactive Map of Severe Weather Events

Create a leaflet map of the United States showing the location of severe weather events which result in at least one death (hint: use EVENT_TYPE). Ignore locations that fall outside the United States. Provide at least three pieces of information on the incident in a popup.

filtered_data <- severe_weather_data %>%
  filter(DEATHS_DIRECT > 0 | DEATHS_INDIRECT > 0) %>%
  select(EVENT_TYPE, BEGIN_LAT, BEGIN_LON, BEGIN_DATE_TIME, STATE, DEATHS_DIRECT, DEATHS_INDIRECT) %>%
  na.omit()

leaflet(data = filtered_data) %>%
  addTiles() %>%
  addMarkers(
    lng = ~BEGIN_LON, lat = ~BEGIN_LAT,
    popup = ~paste(EVENT_TYPE, "<br>", 
                   "Date: ", BEGIN_DATE_TIME, "<br>", 
                   "State: ", STATE, "<br>",
                   "Direct Deaths: ", DEATHS_DIRECT, "<br>",
                   "Indirect Deaths: ", DEATHS_INDIRECT)
  ) %>%
  addProviderTiles(providers$Esri.WorldImagery) %>%
  setView(lng = -98.583333, lat = 39.833333, zoom = 4)

b) Color by Type of Weather Event

Start with the previous map. Now, distinguish the markers of the weather event locations by EVENT_TYPE, i.e. what kind of weather event occurred. If there are too many categories, collapse some categories. Choose an appropriate coloring scheme to map the locations by type of weather event. Add a legend informing the user about the color scheme. Also make sure that the information about the type of weather event is now contained in the popup information. Show this map.

unique(filtered_data$EVENT_TYPE)
##  [1] "Flood"                    "Thunderstorm Wind"       
##  [3] "Flash Flood"              "Lightning"               
##  [5] "Tornado"                  "Heavy Rain"              
##  [7] "Hail"                     "Waterspout"              
##  [9] "Marine Thunderstorm Wind" "Marine Strong Wind"      
## [11] "Debris Flow"
filtered_data$simplified_category <- dplyr::case_when(
  filtered_data$EVENT_TYPE %in% c("Thunderstorm Wind", "Lightning", "Tornado", "Hail", "Marine Thunderstorm Wind", "Marine Strong Wind") ~ "Severe Storms",
  filtered_data$EVENT_TYPE %in% c("Flood", "Flash Flood", "Debris Flow") ~ "Floods",
  filtered_data$EVENT_TYPE == "Heavy Rain" ~ "Heavy Precipitation",
  filtered_data$EVENT_TYPE %in% c("Waterspout") ~ "Marine Events",
  TRUE ~ "Other"
)

color_vector <- c("#FFB6C1", "#40E0D0", "#FF7F50", "#98FB98")
unique_categories <- unique(filtered_data$simplified_category)
category_palette <- colorFactor(palette = color_vector, domain = unique_categories)
leaflet(data = filtered_data) %>%
  addTiles() %>%
  addCircleMarkers(
    lng = ~BEGIN_LON, lat = ~BEGIN_LAT,
    color = ~category_palette(simplified_category),
    popup = ~paste(EVENT_TYPE, "<br>",
                   "Category: ", simplified_category, "<br>",
                   "Date: ", BEGIN_DATE_TIME, "<br>",
                   "State: ", STATE, "<br>",
                   "Direct Deaths: ", DEATHS_DIRECT, "<br>",
                   "Indirect Deaths: ", DEATHS_INDIRECT),
    radius = 5,
    fillOpacity = 0.8
  ) %>%
  addLegend("bottomright", pal = category_palette, values = ~simplified_category,
            title = "Event Category",
            opacity = 1) %>%
  addProviderTiles(providers$Esri.WorldImagery) %>%
  setView(lng = -98.583333, lat = 39.833333, zoom = 4)

c) Cluster

Add marker clustering, so that zooming in will reveal the individual locations but the zoomed out map only shows the clusters. Show the map with clusters.

leaflet(data = filtered_data) %>%
  addTiles() %>%
  addMarkers(
    lng = ~BEGIN_LON, lat = ~BEGIN_LAT,
    clusterOptions = markerClusterOptions(),
    popup = ~paste(EVENT_TYPE, "<br>",
                   "Category: ", simplified_category, "<br>",
                   "Date: ", BEGIN_DATE_TIME, "<br>",
                   "State: ", STATE, "<br>",
                   "Direct Deaths: ", DEATHS_DIRECT, "<br>",
                   "Indirect Deaths: ", DEATHS_INDIRECT)
  ) %>%
  addLegend("bottomright", pal = category_palette, values = ~simplified_category,
            title = "Event Category",
            opacity = 1) %>%
  addProviderTiles(providers$Esri.WorldImagery) %>%
  setView(lng = -98.583333, lat = 39.833333, zoom = 4)

Topic 3: Movie Scripts

Overview

Explore the cinematic world through the lens of movie scripts, analyzing narrative structures, character prominence, and thematic elements across genres and periods. This assignment involves dissecting movie scripts to uncover patterns and trends, employing data visualization techniques to present findings.

Data

We will work with a dataset of approximately 1000 movie scripts and their metadata. This includes the movie’s title, release date, a brief overview, and parsed script files distinguishing dialogue, character information, and scene descriptions. The data was scraped with Aveek Saha’s code.

tagged <- read_csv("/Users/ulrika/Documents/GitHub/course_content/Exercises/09_moviescripts_GRADED/data/tagged.csv.gz", show_col_types = FALSE)
metadata <- read_csv("/Users/ulrika/Documents/GitHub/course_content/Exercises/09_moviescripts_GRADED/data/metadata.csv", show_col_types = FALSE)

1. Dialogue

a) Most Common Words

Analyze the dialogue content of movie scripts. Transform the dialogue into a tidy data frame, breaking down the text into individual words, removing common stop words and other unnecessary elements. As needed, use the cleaning functions introduced in lecture (or write your own in addition) to remove unnecessary words (stop words), syntax, punctuation, numbers, white space, etc. Visualize the 20 most frequently used words in the dialogues to gain insights into the core thematic elements of the scripts.

dialogues <- tagged %>%
  filter(Tag == "Dialogue")

tidy_dataframe <- dialogues %>%
  mutate(Content = str_replace_all(tolower(Content), "[^\\w\\s']", " "),
         Content = str_squish(Content)) %>%
  unnest_tokens(word, Content) %>%
  anti_join(stop_words, by = "word") %>%
  count(word, sort = TRUE)

tidy_dataframe %>%
  top_n(20) %>%
  ggplot(aes(x = reorder(word, n), y = n, fill = sqrt(n))) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  scale_fill_viridis_c(direction = -1) +
  labs(x = "Word", y = "Frequency", title = "Top 20 Most Common Words in Movie Script Dialogues") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 18, face = "bold", hjust = 0.5),
    axis.title = element_text(size = 12),
    axis.text = element_text(size = 12),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_blank(),
    axis.line = element_line(color = "grey", size = 0.5),
    plot.margin = margin(1, 1, 1, 1, "cm")
  )
## Selecting by n
## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

b) Word Cloud

Create word clouds for a selection of movies to visualize the most prevalent themes or phrases. Choose a set of movies, either randomly or based on specific criteria, and use their script dialogues to generate the clouds.

tidy_dataframe_with_id <- dialogues %>%
  select(movie_id, Content) %>%
  mutate(Content = tolower(Content),
         Content = gsub("[^a-z']", " ", Content),
         Content = gsub("\\s+", " ", Content)) %>%
  unnest_tokens(word, Content) %>%
  anti_join(stop_words, by = "word") %>%
  group_by(movie_id, word) %>%
  summarise(n = n(), .groups = 'drop') %>%
  arrange(desc(n))

set.seed(123)
unique_movie_ids <- unique(tidy_dataframe_with_id$movie_id)
selected_movie_ids <- sample(unique_movie_ids, 2)

for (movie_id in selected_movie_ids) {
  movie_dialogues <- tidy_dataframe_with_id %>%
    filter(movie_id == movie_id) %>%
    select(word, n)
  
  cat("Word Cloud", movie_id, "\n")
  wordcloud(
    words = movie_dialogues$word,
    freq = movie_dialogues$n,
    min.freq = 10,
    max.words = 150,
    random.order = FALSE,
    rot.per = 0.35,
    colors = brewer.pal(8, "Dark2"),
    scale = c(4, 0.5)
  )
  cat("\n---\n")
}
## Word Cloud Monkeybone_parsed

## 
## ---
## Word Cloud Ninotchka_parsed

## 
## ---

c) Profanity

Using this list of profanities calculate a profanity score for each movie, indicating how often these words were used in the script. Visualize the Top 10 movies with most profanity, and show how the use of profanity has changed over time (using the movie release date).

profanities_link <- "https://www.cs.cmu.edu/~biglou/resources/bad-words.txt"
profanities <- readLines(con = profanities_link, warn = FALSE)

median_popularity <- median(metadata$popularity, na.rm = TRUE)

tidy_dataframe_with_id <- tidy_dataframe_with_id %>%
  mutate(movie_name = sub("_parsed", "", movie_id, fixed = TRUE))

metadata <- metadata %>%
  mutate(success = ifelse(popularity >= median_popularity, "Successful", "Unsuccessful"))

metadata <- metadata %>%
  mutate(movie_name = sub("_parsed_parsed_parsed", "", file_name, fixed = TRUE))

merged_data <- tidy_dataframe_with_id %>%
  inner_join(metadata, by = "movie_name")

profanity_movies <- merged_data %>%
  filter(word %in% profanities)

profanity_scores <- profanity_movies %>%
  group_by(movie_name) %>%
  summarize(profanity_score = n(), .groups = 'drop')

profanity_scores <- profanity_scores %>%
  inner_join(metadata %>% select(movie_name, release_date, success), by = "movie_name")

top_10_profanity <- profanity_scores %>%
  arrange(desc(profanity_score)) %>%
  slice_head(n = 10)

top_10_profanity$movie_name_modified <- gsub("-", " ", top_10_profanity$movie_name)

# Top 10 Movies with Most Profanity
ggplot(top_10_profanity, aes(x = reorder(movie_name_modified, profanity_score), y = profanity_score, fill = profanity_score)) +
  geom_col() +
  coord_flip() +
  scale_fill_viridis_c(direction = -1) +
  labs(x = "Profanity Score", y = "Movie Name", title = "Top 10 Movies by Profanity Score") +
  theme_minimal() +
  theme(legend.position = "none")

# Profanity Over Time
profanity_scores$year <- format(as.Date(profanity_scores$release_date), "%Y")

profanity_over_time <- profanity_scores %>%
  group_by(year) %>%
  summarize(total_profanity = sum(profanity_score), .groups = 'drop') %>%
  arrange(year)

profanity_over_time$year_numeric <- as.numeric(profanity_over_time$year)

# Generate the plot with adjustments
ggplot(profanity_over_time, aes(x = year_numeric, y = total_profanity)) +
  geom_line() +
  geom_point(color = "royalblue") +  # A shade of blue for the dots
  scale_x_continuous(breaks = seq(min(profanity_over_time$year_numeric, na.rm = TRUE),
                                  max(profanity_over_time$year_numeric, na.rm = TRUE), by = 5),
                     labels = function(x) as.character(x)) +
  labs(x = "Year", y = "Total Profanity Score", title = "Profanity in Movies Over Time") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

d) Simplicity is a Virtue

Examine the impact of script simplicity on its success by calculating a readability score (Flesch Reading Ease, Flesch-Kincaid, or other measures) for the scripts. Analyze and visualize the relationship between the readability of the scripts and their IMDb vote average, providing commentary on your findings.

library(quanteda)
## Package version: 3.3.1
## Unicode version: 14.0
## ICU version: 71.1
## Parallel computing: 8 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
library(quanteda.textstats)

dialogues <- dialogues %>% 
  filter(!is.na(Content))

readability_scores <- dialogues %>%
  group_by(movie_id) %>%
  summarize(script = paste(Content, collapse = " ")) %>%
  mutate(flesch_score = textstat_readability(script, measure = "Flesch")[, "Flesch"])

readability_scores <- readability_scores %>%
  mutate(movie_name = sub("_parsed", "", movie_id, fixed = TRUE))

movie_readability <- readability_scores %>%
  inner_join(metadata, by = "movie_name") %>%
  mutate(genres = str_split(genres, ",\\s*")) %>%
  unnest(genres)

movie_readability <- movie_readability %>%
  group_by(movie_id) %>%
  summarize(mean_flesch_score = mean(flesch_score, na.rm = TRUE),
            mean_vote_average = mean(vote_average, na.rm = TRUE))

ggplot(movie_readability, aes(x = mean_flesch_score, y = mean_vote_average)) +
  geom_point(shape = 19, size = 2, color = "skyblue3") +
  geom_smooth(method = "lm", se = FALSE, color = "tan1", size = 1) +
  labs(x = "Flesch Reading Ease Score", y = "IMDb Vote Average", title = "Script Readability vs. IMDb Vote Average") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    axis.title = element_text(size = 12),
    panel.grid.major = element_line(color = "snow3"),
    panel.grid.minor = element_blank(),
    plot.background = element_rect()
  )
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'

A higher Flesch Reading Ease score indicates a text that is easier to read. I found a positive correlation between the Flesch score and IMDb vote average, which suggests that more readable scripts tend to be associated with higher-rated movies.

2. Genres

a) Emotions

Now, use the NRC Word-Emotion Association Lexicon in the tidytext package to identify a larger set of emotions (e.g. anger, anticipation, disgust, fear, joy, sadness, surprise, trust). Visualize the relationship between the use of words from these categories and the movie genre. What is your finding?

dialogues <- dialogues %>%
  mutate(movie_name = sub("_parsed", "", movie_id, fixed = TRUE))

tidy_dialogues <- dialogues %>%
  mutate(Content = str_replace_all(tolower(Content), "[^\\w\\s']", " "),
         Content = str_squish(Content)) %>%
  unnest_tokens(word, Content) %>%
  anti_join(stop_words, by = "word")

combined_data <- tidy_dialogues %>%
  inner_join(metadata, by = "movie_name") %>%
  mutate(genres = str_split(genres, ",\\s*")) %>%
  unnest(genres)

combined_data <- combined_data %>% filter(!is.na(genres))

nrc_lexicon <- get_sentiments("nrc")

tidy_emotions <- combined_data %>%
  inner_join(nrc_lexicon, by = "word")

emotions_by_genre <- tidy_emotions %>%
  count(genres, sentiment) %>%
  group_by(genres) %>%
  mutate(prop = n / sum(n)) %>%
  ungroup()

ggplot(emotions_by_genre, aes(x = genres, y = prop, fill = sentiment)) +
  geom_bar(stat = "identity") +
  labs(title = "Proportion of Emotions by Genre",
       x = "Genre", y = "Proportion",
       fill = "Emotion") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_viridis_d(option = "plasma")

For this visualization, I used proportions instead of raw counts to facilitate a more direct comparison of how different genres tend to express different emotions. My analysis revealed that surprise and joy are among the least commonly expressed emotions across the various movie genres. While the proportions of emotions varied slightly across genres, I observed a general consistency in the distribution of emotions. However, there were slightly more positive feelings, such as joy and trust, observed in genres like romance, history, and TV movies, and slightly more negative feelings, such as disgust, observed in genres like horror.